I didn’t collaborate with anyone on this assignment.
I trained 7 models all designed to predict Credit. The models basically involve an increasing number of coefficients starting with the simplest (just the intercept) to model7, which has 6 predictors. I will first train all models on the train data (which has 20 observations) and apply it to the test data (which has 380). Then I will compute RMSE values for all models on both train and test, and compare them to discern and explain any trends.
First, let’s fit all 7 models on the training dataset.
Compare and contrast the two curves and hypothesize as to the root cause of any differences.
We can see a number of trends here:
The RMSE values for very simple models (with just the intercept or with just one predictor) are very high. This is because the model is underfitted and consequently doesn’t capture the signal very well.
The RMSE values for both training and test data jump suddenly when two or three predictors are used. This is because the fitted planes capture a lot of the signal in both these models without overfitting.
Beyond that, however, we see that the RMSE value for the test data shoots upwards again, while the value for the training RMSE keeps decreasing. This is because, by this point, we are overfitting the model: we are mistaking error for signal by assessing the random variation in the training data as being a function of the true, underlying function. In the case of multiple regression, this overfitting issue is particularly problematic with a high predictor-to-number of observations ratio, as that may create mathematical complications in terms of finding unique solutions to the equations. This will lower the out-of-sample predictive validity as it is more attuned to the specificities of the training dataset (often mistaking error for signal).
To understand this conceptually, we should ground this in terms of the bias-variance tradeoff. We are increasing variance and lowering our bias as our model becomes overfit. Our expected test MSE depends on three factors, which include the aforementioned two of bias and variance. It happens that bias is inversely related to variance, and so we must find the right balance or ‘tradeoff’.
Repeat the whole process, but let credit_train be a random sample of size 380 from credit instead of 20. Now compare and contrast this graph with the one above and hypothesize as to the root cause of any differences.
In this case, we see a similar initial pattern where the RMSE scores drop dramatically when we fit a plane with two or three predictors. Since the training data has many more observations, it seems reasonable that, on average, unexplained variation in it is larger than in a much smaller test data. The two lines, thus, are much closer, with the test data RMSE line being much more fickle and ‘jumpy’ than in the previous scenario. That said, we still see a positive slope as the number of coefficients move past an optimal value, indicating that we are overfitting our training set.